Procedures and Functions Tutorial

MLDB is the Machine Learning Database, and all machine learning operations are done via Procedures and Functions. Training a model happens via Procedures, and applying a model happens via Functions.

The notebook cells below use pymldb's Connection class to make REST API calls. You can check out the Using pymldb Tutorial for more details.


In [1]:
from pymldb import Connection
mldb = Connection("http://localhost")

Loading a Dataset

The classic Iris Flower Dataset isn't very big but it's well-known and easy to reason about so it's a good example dataset to use for machine learning examples.

We can import it directly from a remote URL:


In [2]:
mldb.put('/v1/procedures/import_iris', {
    "type": "import.text",
    "params": {
        "dataFileUrl": "http://public.mldb.ai/iris.data",
        "headers": [ "sepal length", "sepal width", "petal length", "petal width", "class" ],
        "outputDataset": "iris",
        "runOnCreation": True
    }
})


Out[2]:
PUT http://localhost/v1/procedures/import_iris
201 Created
{
  "status": {
    "firstRun": {
      "runStarted": "2016-03-22T16:20:12.7195733Z", 
      "status": {
        "numLineErrors": 0
      }, 
      "runFinished": "2016-03-22T16:20:13.0135105Z", 
      "id": "2016-03-22T16:20:12.719491Z-5bc7042b732cb41f", 
      "state": "finished"
    }
  }, 
  "config": {
    "params": {
      "headers": [
        "sepal length", 
        "sepal width", 
        "petal length", 
        "petal width", 
        "class"
      ], 
      "outputDataset": "iris", 
      "runOnCreation": true, 
      "dataFileUrl": "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
    }, 
    "type": "import.text", 
    "id": "import_iris"
  }, 
  "state": "ok", 
  "type": "import.text", 
  "id": "import_iris"
}

A quick look at the data

We can use the Query API to get the data into a Pandas DataFrame to take a quick look at it.


In [3]:
df = mldb.query("select * from iris")
df.head()


Out[3]:
sepal length sepal width petal length petal width class
_rowName
97 5.7 2.9 4.2 1.3 Iris-versicolor
11 5.4 3.7 1.5 0.2 Iris-setosa
112 6.4 2.7 5.3 1.9 Iris-virginica
134 6.3 2.8 5.1 1.5 Iris-virginica
142 6.9 3.1 5.1 2.3 Iris-virginica

In [4]:
%matplotlib inline
import seaborn as sns, pandas as pd

sns.pairplot(df, hue="class", size=2.5)


Out[4]:
<seaborn.axisgrid.PairGrid at 0x7f23b54d3410>

Unsupervised Machine Learning with a kmeans.train Procedure

We will create and run a Procedure of type kmeans.train. This will train an unsupervised K-Means model and use it to assign each row in the input to a cluster, in the output dataset.


In [5]:
mldb.put('/v1/procedures/iris_train_kmeans', {
    'type' : 'kmeans.train',
    'params' : {
        'trainingData' : 'select * EXCLUDING(class) from iris',
        'outputDataset' : 'iris_clusters',
        'numClusters' : 3,
        'metric': 'euclidean',
        "runOnCreation": True
    }
})


Out[5]:
PUT http://localhost/v1/procedures/iris_train_kmeans
201 Created
{
  "status": {
    "firstRun": {
      "runStarted": "2016-03-22T16:20:18.0258212Z", 
      "runFinished": "2016-03-22T16:20:18.030994Z", 
      "id": "2016-03-22T16:20:18.025736Z-5bc7042b732cb41f", 
      "state": "finished"
    }
  }, 
  "config": {
    "params": {
      "trainingData": "select * EXCLUDING(class) from iris", 
      "metric": "euclidean", 
      "outputDataset": "iris_clusters", 
      "numClusters": 3, 
      "runOnCreation": true
    }, 
    "type": "kmeans.train", 
    "id": "iris_train_kmeans"
  }, 
  "state": "ok", 
  "type": "kmeans.train", 
  "id": "iris_train_kmeans"
}

Now we can look at the output dataset and compare the clusters the model learned with the three types of flower in the dataset.


In [6]:
mldb.query("""
    select pivot(class, num) as *
    from (
        select cluster, class, count(*) as num
        from merge(iris_clusters, iris)
        group by cluster, class
    )
    group by cluster
""")


Out[6]:
Iris-setosa Iris-versicolor Iris-virginica
_rowName
[0] 50 NaN NaN
[1] NaN 2 36
[2] NaN 48 14

As you can see, the K-means algorithm doesn't do a great job of clustering this data (as is mentioned in the Wikipedia article!).

Supervised Machine Learning with classifier.train and .test Procedures

We will now create and run a Procedure of type classifier.train. The configuration below will use 20% of the data to train a decision tree to classify rows into the three classes of Iris. The output of this procedure is a Function, which we will be able to call from REST or SQL.


In [7]:
mldb.put('/v1/procedures/iris_train_classifier', {
    'type' : 'classifier.train',
    'params' : {
        'trainingData' : """
            select 
                {* EXCLUDING(class)} as features, 
                class as label 
            from iris 
            where rowHash() % 5 = 0
        """,
        "algorithm": "dt",
        "modelFileUrl": "file://models/iris.cls",
        "mode": "categorical",
        "functionName": "iris_classify",
        "runOnCreation": True
    }
})


Out[7]:
PUT http://localhost/v1/procedures/iris_train_classifier
201 Created
{
  "status": {
    "firstRun": {
      "runStarted": "2016-03-22T16:20:18.0753982Z", 
      "runFinished": "2016-03-22T16:20:18.0805328Z", 
      "id": "2016-03-22T16:20:18.075322Z-5bc7042b732cb41f", 
      "state": "finished"
    }
  }, 
  "config": {
    "params": {
      "functionName": "iris_classify", 
      "trainingData": "\n            select \n                {* EXCLUDING(class)} as features, \n                class as label \n            from iris \n            where rowHash() % 5 = 0\n        ", 
      "modelFileUrl": "file://models/iris.cls", 
      "runOnCreation": true, 
      "mode": "categorical", 
      "algorithm": "dt"
    }, 
    "type": "classifier.train", 
    "id": "iris_train_classifier"
  }, 
  "state": "ok", 
  "type": "classifier.train", 
  "id": "iris_train_classifier"
}

We can now test the classifier we just trained on the subset of the data we didn't use for training. To do so we use a procedure of type classifier.test.


In [8]:
rez = mldb.put('/v1/procedures/iris_test_classifier', {
    'type' : 'classifier.test',
    'params' : {
        'testingData' : """
            select 
                iris_classify({
                    features: {* EXCLUDING(class)}
                }) as score,
                class as label 
            from iris 
            where rowHash() % 5 != 0
        """,
        "mode": "categorical",
        "runOnCreation": True
    }
})

runResults = rez.json()["status"]["firstRun"]["status"]
print rez


<Response [201]>

The procedure returns a confusion matrix, which you can compare with the one that resulted from the K-means procedure.


In [9]:
pd.DataFrame(runResults["confusionMatrix"])\
    .pivot_table(index="actual", columns="predicted", fill_value=0)


Out[9]:
count
predicted Iris-setosa Iris-versicolor Iris-virginica
actual
Iris-setosa 40 0 0
Iris-versicolor 0 37 2
Iris-virginica 0 6 38

As you can see, the decision tree does a much better job of classifying the data than the K-means model, using 20% of the examples as training data.

The procedure also returns standard classification statistics on how the classifier performed on the test set. Below are performance statistics for each label:


In [10]:
pd.DataFrame.from_dict(runResults["labelStatistics"]).transpose()


Out[10]:
f precision recall support
Iris-setosa 1.000000 1.000000 1.000000 40
Iris-versicolor 0.902439 0.860465 0.948718 39
Iris-virginica 0.904762 0.950000 0.863636 44

They are also available, averaged over all labels:


In [11]:
pd.DataFrame.from_dict({"weightedStatistics": runResults["weightedStatistics"]})


Out[11]:
weightedStatistics
f 0.934997
precision 0.937871
recall 0.934959
support 123.000000

Scoring new examples

We can call the Function REST API endpoint to classify a never-before-seen set of measurements like this:


In [12]:
mldb.get('/v1/functions/iris_classify/application', input={
    "features":{
        "petal length": 1,
        "petal width": 2,
        "sepal length": 3,
        "sepal width": 4
    }
})


Out[12]:
GET http://localhost/v1/functions/iris_classify/application?input=%7B%22features%22%3A+%7B%22sepal+width%22%3A+4%2C+%22petal+width%22%3A+2%2C+%22petal+length%22%3A+1%2C+%22sepal+length%22%3A+3%7D%7D
200 OK
{
  "output": {
    "scores": [
      [
        "\"Iris-setosa\"", 
        [
          1, 
          "-Inf"
        ]
      ], 
      [
        "\"Iris-versicolor\"", 
        [
          0, 
          "-Inf"
        ]
      ], 
      [
        "\"Iris-virginica\"", 
        [
          0, 
          "-Inf"
        ]
      ]
    ]
  }
}

Where to next?

Check out the other Tutorials and Demos.

You can also take a look at the classifier.experiment procedure type that can be used to train and test a classifier in a single call.


In [ ]: